Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Nov 11 15:23:17 2020"
The text continues here.
How I am feeling right now?
Well, I would say excited but also confused :) I am just at the beginning of learning to use R and then also using GitHub makes it more complicated but it is also a great challenge. I think it will help me a lot to understand my the work of others using R. Besides, it is always good to learn something new!!
What I am expecting to learn?
Where I heard from this course?
Through our UEF Yammer –> I am just a bit late, saw the add this Thursday morning.
https://github.com/SusiCS/IODS-project.git
Happy to be in this course
my mood at the moment
Amazing – It is working!!!!
Describe the work you have done this week and summarize your learning.
date()
## [1] "Wed Nov 11 15:23:17 2020"
Here we go again…
Integrating data from the web into R
Data frame is learning2014 with 183 observations and 60 variables. The description of the variables are available here
how to do that , see create_learning2014.R in my github
learning2014<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=T)
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data file includes 166 observations and 7 variables. The variables are gender, age, attitude (global attitude towards statistic, includes 10 questions, displayed is the mean of 10 questions), deep (deep questions includes 12 questions, displayed is the mean of the 12 questions), stra (strategic questions includes 8 questions, displayed is the mean of 8 questions), surf (surface questions, include 12 questions, displayed is the mean of the 12 questions) and points (exam points). Attitude, deep, stra, surface questions were answered view Lieke scale.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# alpha=0.5 lightens the colour, bins for the visual scale display
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.1), lower = list(combo = wrap("facethist", bins = 20)))
p
This graph shows an overview of data distribution and correlation. First of all, in this data set more women participated than men. Most of the subjects were younger than 30. The exam points were in average 25 points, The attitude questions of the female were answered in average with lower points (average ~2.9) than the male (average ~3.3).The deep question results are quite the same between male and female (m:3.7;f:3.6). Also for strategic (f:3.6;m:3.5) and surface questions it is very similar (f:2.8;m:2.4). Also the correlation in total (black) and the correlation of male and female are displayed. Significant negative correlation is between surface questions and attitude questions.The overall neg. correlation is -0.176 with a p-Value < 0.05. for the female there is no significant correlation but for the male it is -0.374 with a p-value < 0.01. In an easier way, if the male answered the attitude questions with higher points the surface questions were answered with lower points and other way round. A very high significant negative correlation with a p-value of 0.001 is between deep questions and surface question (-0.324) Here is also only a significance for the male recognized (-0.622). Also here, if the male subjects answered the question of deep with high points the surface questions were answered with lower points and verce vice. An example for a positve significant correlation is between attitude and points (0.437, p-value<0.001). The older the subject the higher rated the attitude questions. This was for the female subjects (0.422, p-valuez0.001) as wel as for the male participants (0.451, p-value<0.001). Another negative correlation is between the strategic questions and the surface question (p<0.05). However only in overall (-0.161).
my_model_p <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model_p)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
This a regression model with points as dependent variable and the explanatory variables are attitude, strategic, surface. This was chosen due to highest correlations (see graph above). The first results are a summary of the residuals.The residuals are the distance from the data to the fitted line we created. Ideally the data points should be symmetrical distributed around the line (this would be the case if the min and max value has approximately hte same distance to 0 and also the first and third quartile has the same distance to 0, ideally median should be close to 0). The median in our model is near 0 and also the 1Q and 3Q are quite equal. However, the min and max values are not similar. An indicator that the model might not be the best. Next, the Coefficients are presented. For this model (linear regression) the formula is: **points = alpha (=intercept) + battitude + b1stra + b2*surf + e** The estimated values (first column) are the slope of the formula. This means for example, if attitude questions would be increased by 1 point (Lieke scale 1-5), the number of points in the exam would increase by ~3.4.For 1 “unit” strategic questions exam points increase by 0.85 and for the surface questions the exam points would decrease by -0.6. The standard error of estimate and t-value (column 2+3) are provided to show how the p-value were calculated. The pr values is the p-value for estimated parameters. We can see here that only attitude has a significance in this model. This means it seems that only the attitude questions have an influence on exam points. e is a random error component. The R^2 is 0.2074. So attitude questions, surface questions and strategic questions can explain ~21% the exam points. This is not very high. The adjustes R^2 is always smaller. It takes the vumers of variables into acount (or better the degree of freedom). This values is 0.1927. The significance for this modrl is very high (3.15e-08). This means that there is a relation between the exam points and the explinatory variables. The questions that rise up are: Are the variables are correlated to each other?(Can not be answered yet-> don’t know how to do it in R :) ) And the non significant variables in this model, do we need them for a good model? The new model without the significant variables are shown in 4..
my_model_p1<-lm(points ~ age + attitude + stra , data=learning2014)
summary(my_model_p1)
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
This model this better to fit. The adjusted R^2 is higher than in the previews model. For age and str variables the p value for the estimated slope values is 0.1 not very low and actually not significant but still better than in the first model.
my_model_p2<-lm(points ~ attitude + stra , data=learning2014)
summary(my_model_p2)
This model takes the explanatory variables attitude and stra into account to explain the exam points. The results are quite similar. The adjusted R^2 is a little bit higher with R^2= 0.1951. This means the model could be a bit better than my_model_p. The surface questions have probably no influence or very very low influence on exam points. The strategic questions would now take less influence (0.9137) on the exam as well as the attitude (3.467). However, the strategic estimated value has a lower p-value (0.0621) than before. The overall p-value of 7.734e-09 is also better compared to the first model my_model_p.
my_model_p3<- lm(points ~ attitude, data = learning2014)
summary(my_model_p3)
This is now the model with only one explanatory variable attitude. The residual summary is worse compared to the other models. The estimated value for attitude is now 3.5255. The exam points increase (improve) about 3.5 points if the attitude questions in mean increase about 1 point. There is a relation between both variables because the the attitude is high significant (p-value <0.001). The R^2 is lower compared to the other models. It says that 19% of exam points can be explained by the attitude. however, we should consider the adjusted R^2. It is scaled by the number of parameters in the model. If we add more and more variables, the R^2 would also increase. But it doesn’t say if these variable contribute to the dependent variable (If they have a relation to that variable). The adjusted R^2 is calculated by taking the degrees of freedom (which means also the explanatory variables) into account. The power of the model is then reduced.
As a conclusion, model_p1 (age, attitude, stra) is the best fitted model if we look at the adjusted R^2. The second best would be model_p3 (only attitude variable), Therefore these both models will be graphical displayed in task 5.
my_model_gra <- lm(points ~ attitude + stra, data = learning2014)
par(mfrow = c(2,2))
plot(my_model_gra, which = c(1,2,5),main = "Graphical model validation")
my_model_gra_alt <- lm(points ~ attitude + stra + age, data = learning2014)
par(mfrow = c(2,2))
plot(my_model_gra_alt, which = c(1,2,5),main = "Graphical model validation 2")
my_model_gra_alt1 <- lm(points ~ attitude, data = learning2014)
par(mfrow = c(2,2))
plot(my_model_gra_alt1, which = c(1,2,5),main = "Graphical model validation 3")
The assumptions of the linear regression:
The “Residuals vs Fitted” graph shows the residuals from the data and if they have non-linear patterns. This means they are random which means that there is a Equal Error Variance. So this assumption is fulfilled. This graph is also used for Independent Observations. We can see no correlation in this graph, so this assumption is also fine.
Normal Distribution is checked from a Normal Q-Q plot. This shows us if the errors/residuals are normally distributed. This is not the case in our data set. We have at the beginning and at the end some outliers that don’t fit to the line. This assumption is not fully fulfilled. We have for our data a little negative skew. However, we should think about if these outlier are influential to this model. This will be checked with the last plot.
The Residuals vs Leverage shows any influential cases. Outlines don’t have to be always influential in a regression model. Cases outside the Cook’s distance line are outliers that would influence the regression model and should be eventually excluded. Here we cannot even see the Cook’s distance line in the graph. Which means we have no influential outliers.
Taken all together, the model is ok, but it doesn’t explain quite well the relation for the exam points (20% can be explain by attitude and strategic questions). Probably, better models should be applied. For me the best model is p1 with explanatory variables: attitude, age and strategic.
Feeling of the week
1./2. reading the data
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")
To look at the dimension and structure of the data. FOr a description of the variables see here
dim(alc)
## [1] 370 51
str(alc)
## 'data.frame': 370 obs. of 51 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "no" "no" ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : chr "yes" "no" "no" "no" ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : chr "yes" "no" "yes" "yes" ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
This matched data set contains now 370 obervations and 51 variables. The data are student achievement in secondary education of 2 schools. It is shown from which school the pupils are, their age, sex, if they live in urban or rural area, etc and it cointain also their alcohol consumption and weather it is high or not.
3. hypothesis towards alcohol
The use of alcohol can have certain numbers of reason. For my analysis I will use the gender, the class failure, quality of family relationship, going out with friends. The gender is interesting because male and female handle the alcohol consumption in a different way. normally females drink more often with friends, whereas males also drink more often alone. ALso the drinks are usually different. Males drink more beer and strong liquids, females more sparkling wine, wine and longdrinks.SO male drink more than female (my hypothesis :) ) Class failure could be also interesting. The more failure the more the tendency to drink. IS the family a supportive and cosy environment (quality of fam relationship), the alcohol consumption should be less (less wrong friends, stronger resilience) And of course, if you go out with friends the tendency to drink will be probably higher. Even when you have not a good family life.
4. testing the hypothesis
barplot
library(tidyr); library(dplyr); library(ggplot2)
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Unfortunately, you cannot see well the graphs. However, you it is shown that most of the students have no failures. Just a few have 1 or 2 past class failures. Most of the students have good family relationships (4) or even very good relationships (5).The variable goout looks quite normal distributed. most of them go moderate out (3). In this data file a slight higher number of female students were asked.
crosstable
alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 154
## 2 F TRUE 41
## 3 M FALSE 105
## 4 M TRUE 70
alc %>% group_by(failures, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'failures' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups: failures [4]
## failures high_use count
## <int> <lgl> <int>
## 1 0 FALSE 238
## 2 0 TRUE 87
## 3 1 FALSE 12
## 4 1 TRUE 12
## 5 2 FALSE 8
## 6 2 TRUE 9
## 7 3 FALSE 1
## 8 3 TRUE 3
alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: famrel [5]
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 9
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 128
## 8 4 TRUE 52
## 9 5 FALSE 77
## 10 5 TRUE 23
alc %>% group_by(goout, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: goout [5]
## goout high_use count
## <int> <lgl> <int>
## 1 1 FALSE 19
## 2 1 TRUE 3
## 3 2 FALSE 82
## 4 2 TRUE 15
## 5 3 FALSE 97
## 6 3 TRUE 23
## 7 4 FALSE 40
## 8 4 TRUE 38
## 9 5 FALSE 21
## 10 5 TRUE 32
If we look at the crosstables, which shows the relationship between high use of alcohol and the other variables, we see that more female student drink less (154 females to 105 males) alcohol and more male students drink more alcohol(70 males to 41 females). –> Hypotheses is fulfilled. For the failures, we can see that no fails of students have the highest number of low alcohol use (238) but also for high alcohol use (87). This means the hypothesis is not fulfilled. But if we look at the relation to the students in each category, the alcohol use is higher - hypothesis fulfilled. For the family relation is the higher the quality of th relationships is the higher is the number of low alcohol drinker but also the more high use of alcohol is recognized. The highest number of high use of alcohol is at good relationships with 52 students (grade 4).But it has also the highst number no low alcohol drinker (128) It seems that also here that the (hypothesis is not fulfilled). However if we look at the realtions of false and true of each grade, it might looks different. Is the relationship not good, 1/4-1/2 drink more alcohol. This is getting lower the higher the quality of relationships are.Hypothesis fulfilled if we look at each category sepratly. For the times, how often students go out and how much they drink, we can see the more student go out the more they drink. Hypothesis fulfilled.
boxplot
fail<- ggplot(alc, aes(x = high_use, y = failures))+ geom_boxplot() + ggtitle("Relation of failures and alcohol consumption")
fail
fail1<-ggplot(alc, aes(x = high_use, y = failures, col = sex))+ geom_boxplot() + ggtitle("Relation of failures, alcohol consumption and sex")
fail1
fam<- ggplot(alc, aes(x = high_use, y = famrel))+ geom_boxplot() + ggtitle("Relation of quality in family relationship and alcohol consumption")
fam
fam1<-ggplot(alc, aes(x = high_use, y = famrel, col = sex))+ geom_boxplot() + ggtitle("Relation of quality in family relationship, alcohol consumption and sex")
fam1
go<- ggplot(alc, aes(x = high_use, y = goout))+ geom_boxplot() + ggtitle("Relation of going out with friends and alcohol consumption")
go
go1<-ggplot(alc, aes(x = high_use, y = goout, col = sex))+ geom_boxplot() + ggtitle("Relation of going out with friends, alcohol consumption and sex")
go1
Here are the boxplots. They show an overall view of the data. For the failure, the boxplot seems not to be suitable (Don’t know why…). No statement can be made. For the family relation we see that the higher the family relationships were ranked the less alcohol were drunken. The hypothesis can be confirmed. It seems also that more male studens have a better relationship with the family. ALso the more the students go out, the more the alcohol they drink. Also more male students go out and drink also more. Hypothesis is fulfilled.
5. logistic regression
model + odd ratio and confidence intervalls (CI)
m <- glm(high_use ~ goout + famrel + sex+ failures, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + famrel + sex + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7648 -0.7602 -0.5209 0.8660 2.4221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3222 0.6487 -3.580 0.000344 ***
## goout 0.7663 0.1230 6.230 4.65e-10 ***
## famrel -0.4178 0.1407 -2.969 0.002991 **
## sexM 0.9495 0.2583 3.676 0.000237 ***
## failures 0.4754 0.2237 2.126 0.033544 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 374.44 on 365 degrees of freedom
## AIC: 384.44
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) goout famrel sexM failures
## -2.3222263 0.7663065 -0.4178083 0.9495276 0.4754263
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09805504 0.02652005 0.3401879
## goout 2.15180388 1.70202235 2.7597607
## famrel 0.65848847 0.49780462 0.8659968
## sexM 2.58448845 1.56615523 4.3203098
## failures 1.60869986 1.04482774 2.5276864
THe summery of the model shows first a summary of residuals. This looks fine since they are close to being centered on = and roughly symmetrical. The coefficients are shown afterwards, including the standard error, z value and the p value of z. All chosen variables show significance which means they can explain the high_use variable. The highest significance is for male students and goout. We can assume that for one more unit (answering the questions by 1 point more) the high_use alc increase by 0.7663 (for goout), 0.47 (for failures) and about 0.95 if it is a male student. It decreases around -0.4 if the fam question is increased around 1 point. The AIC (Akaike Information Criterion) is 384.44. Can be used to compare the model to other models. (It is just the residual deviance adjusted for the number of parameters in the model)
After the model, you can find the coefficients showed a odd ratios. Odd ratios are not the probabilities but the show the ratio of change that something happens to the change that something not happens. It quantifies the association between 2 events. Is the odd ratio greater than 1 we have a positive association and if the odd ratio is smaller than 1 its a negative association. For our model we can say that failure, goout and male have a positive association to higher use of alcohol, whereas the quality of family relationship is negatively associated to high_alc use. Compared to the former hypothesis, these results agree to the hyposesis.
Because failure is not so high significant as the other, here are the logistic regression model without it:
m1 <- glm(high_use ~ goout + famrel + sex, data = alc, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ goout + famrel + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5998 -0.7657 -0.5342 0.8074 2.5563
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2863 0.6453 -3.543 0.000396 ***
## goout 0.7979 0.1226 6.506 7.71e-11 ***
## famrel -0.4350 0.1400 -3.107 0.001892 **
## sexM 0.9906 0.2561 3.867 0.000110 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 379.11 on 366 degrees of freedom
## AIC: 387.11
##
## Number of Fisher Scoring iterations: 4
Here you can see that the AIC is slightly higher, so the model seems to fit better. Therefore I will use these variables for the predictive model.
6. predictive model
m <- glm(high_use ~ goout + famrel + sex, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")# predict() the probability of high_use
alc <- mutate(alc, probability = probabilities)# add the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5)# use the probabilities to make a prediction of high_use
select(alc, goout, famrel, sex, high_use, probability, prediction) %>% tail(10)
## goout famrel sex high_use probability prediction
## 361 3 5 M TRUE 0.25406225 FALSE
## 362 3 4 M TRUE 0.34478448 FALSE
## 363 1 4 M TRUE 0.09640288 FALSE
## 364 4 3 M TRUE 0.64356587 TRUE
## 365 2 3 M FALSE 0.26797363 FALSE
## 366 3 4 M TRUE 0.34478448 FALSE
## 367 2 4 M FALSE 0.19155369 FALSE
## 368 4 4 M TRUE 0.53888551 TRUE
## 369 4 5 M TRUE 0.43065938 FALSE
## 370 2 4 M FALSE 0.19155369 FALSE
table(high_use = alc$high_use, prediction = alc$prediction)# creating a confusion matrix with actual values
## prediction
## high_use FALSE TRUE
## FALSE 241 18
## TRUE 62 49
(241+49)/(241+49+18+62)#accuracy of the model
## [1] 0.7837838
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins # creating a confusion matrix with predicted values
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65135135 0.04864865 0.70000000
## TRUE 0.16756757 0.13243243 0.30000000
## Sum 0.81891892 0.18108108 1.00000000
The first table (confusion matrix with actual values) shows that 241 students who actually don’t drink much alcohol and this was also predicted by the model. 18 cases were wrong predicted, so that these 18 students drink actually less alcohol but the model predict is as high alcohol consume. For high use, the model predicted 49 students as right but 62 students were considers as less alcohol use although, which is wrong. As a result we can say the model is accurate for 78,4 percent. The second table shows the predicted values, so the probabilities.
visual probability
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ geom_point() + ggtitle("logistic regression model with the variable famrel, sex and goout")
g
This is the visualization of the model.
training error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability) # compute the average number of wrong predictions in the (training) data
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc)) # K-fold cross-validation, cv=training data
cv$delta[1]# average number of wrong predictions in the cross validation
## [1] 0.2216216
To see how good the model is, a test will be conducted. For that new test variables will be used. The wrong predicted value number is 22. Which is quite ok.
Bonus question
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability) # compute the average number of wrong predictions in the (training) data
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10) # K-fold cross-validation, cv=training data
cv$delta[1]# average number of wrong predictions in the cross validation
## [1] 0.2351351
My model has a better test set perfomance, because it has 24 than 26 in the Data camp.
Well, this week…
(more chapters to be added similarly as we proceed with the course!)